Project work: A mini segmentation challengeΒΆ
Student: $\Rightarrow$ Alexander Moor, Curdin Bosshart Email: $\Rightarrow$ alexander.moor@students.fhnw.ch, curdin.bosshart@students.fhnw.ch University: $\Rightarrow$ FHNW Semester: $\Rightarrow$ 2. Semester Date: $\Rightarrow$ DATE OF SUBMISSION
AbstractΒΆ
$\Rightarrow$ A brief summary of your project in 2-3 sentences.
Table of contentsΒΆ
Prerequisites / SetupΒΆ
$\Rightarrow$ Special setup instructions, imports and configurations go here.
# first approach (Frangi)
import cv2
import os
import glob
import numpy as np
import matplotlib.pyplot as plt
from skimage.filters import frangi
from skimage import img_as_float
import json
from PIL import Image, ImageDraw
# second approach (U-Net)
import json
import os
import shutil
import random
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from collections import defaultdict
import numpy as np
from PIL import Image, ImageDraw
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
DatasetΒΆ
The dataset used in this project is the "Spaghetti Detector" dataset from Roboflow, which contains images of a common 3D printing failure. This failure occurs when a print detaches from the print bed, causing the printer to continue extruding filament into the air, creating a tangled mess known as "spaghetti." These failures should be detected early, as they waste material and can lead to more serious issues. If unnoticed, the extruded filament can clog the nozzle, which in turn may pose a fire hazard due to overheating.
While the images in the dataset are not taken from live surveillance cameras, they resemble the kind of poor-quality footage typically captured by low-end webcams often used to monitor 3D printers. The print beds are usually moving, and lighting is uncontrolled, resulting in blurry, low-resolution, and inconsistent images. This degraded image quality is representative of what would be expected in real-world monitoring setups, which makes the dataset suitable for developing approaches that could later be applied to camera-based printer surveillance.
As an additional challenge, the print bed often includes grid lines or patterned surfaces that can interfere with segmentation by creating misleading edges or textures. In many cases, other objects such as cables, tools, or loose parts are also visible in the frame, cluttering the background and complicating the detection of the actual printing failure.
PreprocessingΒΆ
Since two different approaches were used in the project, separate data pre-processing steps were required for each, as described below.
Manual segmentationΒΆ
The software QuPath was used for the manual image segmentation. QuPath was chosen, as it was one of the recommended options and it was found to be the easiest of them to install on Linux. Eleven images were manually segmented without any pre-processing.
The dataset also includes annotated images, but the annotations are only rough boxes around the structures.
Automated SegmentationΒΆ
First Approach using a Frangi MethodΒΆ
In this approach, the images were first sharpened using a custom kernel to enhance edges and highlight fine structures. This improved contrast at object boundaries and helped later processing steps. The sharpened images were then converted to grayscale to reduce complexity. A Frangi filter was applied to enhance curvilinear, line-like features common in the target structures. The Frangi response was thresholded to produce a binary mask, separating enhanced structures from the background. A circularity filter was used to remove irregular and elongated shapes, keeping mainly round ones and helping to exclude artifacts like grid lines. Morphological dilation was optionally applied to fill gaps and improve shape continuity.
For evaluation, manually labeled images were used. These annotations were made in QuPath and exported as GeoJSON files. They were loaded and converted to binary masks matching the dimensions of the input images. The predicted masks were compared to the manual ones using Intersection over Union (IoU), Dice coefficient, and pixel accuracy. These metrics quantify overlap and provide a measure of how well the automated method matches the manual segmentation.
# --- Parameters ---
BASE_DIR = "data_frangi"
FRANGI_THRESHOLD = 0.2
MIN_CIRCULARITY = 0.2
DILATE_KERNEL_SIZE = 50 # set to 0 to disable dilation
# --- Helper Functions ---
def sharpen_image(img):
kernel = np.array([[0, -1, 0],
[-1, 5, -1],
[0, -1, 0]])
return cv2.filter2D(img, -1, kernel)
def show_comparison(imgs, titles, per_row=2):
n = len(imgs)
rows = (n + per_row - 1) // per_row
fig, axs = plt.subplots(rows, per_row, figsize=(5 * per_row, 5 * rows))
axs = axs.flatten()
for i in range(n):
ax = axs[i]
img = imgs[i]
if len(img.shape) == 2:
ax.imshow(img, cmap='gray')
else:
ax.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
ax.set_title(titles[i])
ax.axis('off')
for i in range(n, len(axs)):
axs[i].axis("off")
plt.tight_layout()
plt.show()
def filter_round_shapes(binary_mask, min_circularity=0.2):
contours, _ = cv2.findContours(binary_mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
clean = np.zeros_like(binary_mask)
for c in contours:
area = cv2.contourArea(c)
perimeter = cv2.arcLength(c, True)
if perimeter == 0:
continue
circularity = 4 * np.pi * area / (perimeter ** 2)
if circularity > min_circularity:
cv2.drawContours(clean, [c], -1, 255, cv2.FILLED)
return clean
# --- Load GT mask from GeoJSON ---
def load_geojson_mask(index, shape):
path = os.path.join(BASE_DIR, f"{index}.geojson")
with open(path) as f:
data = json.load(f)
mask = Image.new('L', (shape[1], shape[0]), 0)
draw = ImageDraw.Draw(mask)
for feature in data['features']:
coords = feature['geometry']['coordinates']
# Handle both Polygon and MultiPolygon structures
if feature['geometry']['type'] == 'Polygon':
polys = [coords]
elif feature['geometry']['type'] == 'MultiPolygon':
polys = coords
else:
print(f"Skipping unsupported geometry type in {path}")
continue
for poly in polys:
try:
# Only first ring (outer boundary)
xy = [tuple(map(int, pt)) for pt in poly[0] if len(pt) == 2]
draw.polygon(xy, outline=255, fill=255)
except Exception as e:
print(f"Skipping polygon in {path}: {e}")
continue
return np.array(mask)
# --- Metric calculation ---
def evaluate_masks(pred, gt):
pred_bin = (pred > 0).astype(np.uint8)
gt_bin = (gt > 0).astype(np.uint8)
intersection = np.logical_and(pred_bin, gt_bin).sum()
union = np.logical_or(pred_bin, gt_bin).sum()
iou = intersection / union if union > 0 else 0
dice = 2 * intersection / (pred_bin.sum() + gt_bin.sum()) if (pred_bin.sum() + gt_bin.sum()) > 0 else 0
accuracy = (pred_bin == gt_bin).sum() / pred_bin.size
return {"IoU": iou, "Dice": dice, "Pixel Acc": accuracy}
# --- Main Processing ---
pattern = os.path.join(BASE_DIR, "*.jpg")
paths = sorted(glob.glob(pattern), key=lambda p: int(os.path.splitext(os.path.basename(p))[0]))
all_scores = []
for path in paths:
img = cv2.imread(path)
filename = os.path.basename(path)
index = int(os.path.splitext(filename)[0])
title = str(index)
# Step 1: Sharpen
sharp = sharpen_image(img)
# Step 2: Grayscale + Frangi
gray = cv2.cvtColor(sharp, cv2.COLOR_BGR2GRAY)
gray_float = img_as_float(gray)
frangi_response = frangi(gray_float)
# Step 3: Threshold
binary = (frangi_response > FRANGI_THRESHOLD).astype(np.uint8) * 255
# Step 4: Filter roundish shapes
filtered = filter_round_shapes(binary, min_circularity=MIN_CIRCULARITY)
# Step 5: Optional dilation
kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (DILATE_KERNEL_SIZE, DILATE_KERNEL_SIZE))
filtered = cv2.dilate(filtered, kernel)
# Step 6: Show everything
gt_mask = load_geojson_mask(index, filtered.shape)
show_comparison(
[img,
gray,
(frangi_response * 255).astype(np.uint8),
binary,
filtered,
gt_mask],
[f"{title} - Original",
f"{title} - Sharpened + Grayscale",
f"{title} - Frangi Response",
f"{title} - Thresholded",
f"{title} - Final Mask",
f"{title} - Manual Mask"]
)
# Step 7: Evaluate
scores = evaluate_masks(filtered, gt_mask)
print(f"Image {index} - IoU: {scores['IoU']:.3f}, Dice: {scores['Dice']:.3f}, Accuracy: {scores['Pixel Acc']:.3f}")
all_scores.append(scores)
mean_iou = np.mean([s["IoU"] for s in all_scores])
mean_dice = np.mean([s["Dice"] for s in all_scores])
mean_acc = np.mean([s["Pixel Acc"] for s in all_scores])
print(f"\nMean IoU: {mean_iou:.3f}, Mean Dice: {mean_dice:.3f}, Mean Accuracy: {mean_acc:.3f}")
Image 1 - IoU: 0.586, Dice: 0.739, Accuracy: 0.951
Image 2 - IoU: 0.238, Dice: 0.385, Accuracy: 0.934
Image 3 - IoU: 0.430, Dice: 0.601, Accuracy: 0.863
Image 4 - IoU: 0.210, Dice: 0.348, Accuracy: 0.828
Image 5 - IoU: 0.639, Dice: 0.779, Accuracy: 0.854
Image 6 - IoU: 0.423, Dice: 0.595, Accuracy: 0.668
Image 7 - IoU: 0.537, Dice: 0.699, Accuracy: 0.861
Image 8 - IoU: 0.446, Dice: 0.617, Accuracy: 0.848
Image 9 - IoU: 0.425, Dice: 0.597, Accuracy: 0.958
Image 10 - IoU: 0.532, Dice: 0.694, Accuracy: 0.888
Image 11 - IoU: 0.515, Dice: 0.679, Accuracy: 0.930 Mean IoU: 0.453, Mean Dice: 0.612, Mean Accuracy: 0.871
Second Approach Using a Neural NetworkΒΆ
Data Preparation
We define a custom PyTorch Dataset class called SpaghettiDataset to load pairs of images and their corresponding segmentation masks.
Each image is loaded from disk, converted to RGB format (color channels), and each mask is converted to L mode (grayscale, single channel).
We apply a resize transform to both images and masks, setting them to 512Γ512 pixels. This matches the expected input size of the neural network architecture that will follow after this, and ensures consistent dimensions across the dataset.
After loading, we normalize the image pixel values to the [0, 1] range by dividing by 255.
We then convert the image from shape (height, width, channels) to (channels, height, width) using .permute(2, 0, 1) because PyTorch expects the channel dimension first.
For the mask, we apply a binary threshold (> 0.5) to ensure itβs treated as a foreground/background map, and we add an extra dimension using .unsqueeze(0) so it has shape (1, height, width), again matching the expected shape for the network.
We randomly shuffle the full image list and split it into:
- 80% training
- 10% validation
- 10% test
Finally, we wrap these datasets in PyTorch DataLoader objects, which handle batching, shuffling, and parallel data loading for efficient training and evaluation.
resize_transform = transforms.Compose([
transforms.Resize((512, 512)),
])
class SpaghettiDataset(Dataset):
def __init__(self, image_dir, mask_dir, file_list, transform=None):
self.image_dir = image_dir
self.mask_dir = mask_dir
self.file_list = file_list
self.transform = transform
def __len__(self):
return len(self.file_list)
def __getitem__(self, idx):
img_name = self.file_list[idx]
img_path = os.path.join(self.image_dir, img_name)
mask_path = os.path.join(self.mask_dir, os.path.splitext(img_name)[0] + '_mask.png')
image = Image.open(img_path).convert('RGB')
mask = Image.open(mask_path).convert('L')
image = resize_transform(image)
mask = resize_transform(mask)
image = np.array(image) / 255.0
mask = np.array(mask) / 255.0
mask = (mask > 0.5).astype(np.float32)
image = torch.tensor(image, dtype=torch.float32).permute(2, 0, 1)
mask = torch.tensor(mask, dtype=torch.float32).unsqueeze(0)
return image, mask
# Directories
image_dir = 'images/unified_image_pool'
mask_dir = 'images/unified_segmentation_masks'
# Collect image filenames
all_files = [f for f in os.listdir(image_dir) if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
random.shuffle(all_files)
# Split into train/val/test
n = len(all_files)
train_files = all_files[:int(0.8 * n)]
val_files = all_files[int(0.8 * n):int(0.9 * n)]
test_files = all_files[int(0.9 * n):]
# Create datasets
train_dataset = SpaghettiDataset(image_dir, mask_dir, train_files)
val_dataset = SpaghettiDataset(image_dir, mask_dir, val_files)
test_dataset = SpaghettiDataset(image_dir, mask_dir, test_files)
# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=4, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=4, shuffle=False)
print(f"Train: {len(train_dataset)} images")
print(f"Val: {len(val_dataset)} images")
print(f"Test: {len(test_dataset)} images")
Train: 572 images Val: 71 images Test: 72 images
Neural Network
After some brief research, we discovered U-Net, a purpose built neural network architecture intended specifically for image segmentation. It was published in a paper by Ronneberger et al. in 2015 and appears to have found widespread adoption in the world of image processing. We have decided to use it since it can, apparently, be implemented essentially exactly as presented in the original paper with good to excellent results. The image below was taken from the original publication and shows the architecture of the neural network, which was recreated in code with the help of ChatGPT.

How it works
Encoder
This part takes the big image and gradually compresses it to smaller and smaller feature maps.
3Γ3 convolutions:
Small pattern detectors that slide across the image, looking for things like edges, corners, or textures.Batch normalization:
Keeps the signal values stable, helping the network train faster and more reliably.ReLU activations:
Introduces nonlinearity, allowing the network to learn complex shapes and patterns.Max pooling:
Reduces the size of the image by keeping only the strongest signals, helping the network focus on the most important features and saving computation.
Bottleneck
At the bottom, after all the downsampling, the network has a compressed feature map containing the most abstract, high-level understanding of the input image.
Decoder
This part takes the compressed features and rebuilds them back to the original image size.
Transposed convolutions (learned upsampling):
Instead of just stretching the image, the network learns how to fill in new pixels so they best match the patterns it saw during training.Skip connections:
Bring back detailed features (like edges and fine textures) from the encoder, helping the decoder make precise pixel predictions.
The final 1Γ1 convolution
This takes the last expanded feature map and converts it into a single value per pixel, often a probability, like βhow likely is this pixel part of the object weβre segmenting?β
class DoubleConv(nn.Module):
def __init__(self, in_channels, out_channels):
super(DoubleConv, self).__init__()
self.double_conv = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True),
nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True)
)
def forward(self, x):
return self.double_conv(x)
class UNet(nn.Module):
def __init__(self, in_channels=3, out_channels=1):
super(UNet, self).__init__()
self.down1 = DoubleConv(in_channels, 64)
self.pool1 = nn.MaxPool2d(2)
self.down2 = DoubleConv(64, 128)
self.pool2 = nn.MaxPool2d(2)
self.down3 = DoubleConv(128, 256)
self.pool3 = nn.MaxPool2d(2)
self.down4 = DoubleConv(256, 512)
self.pool4 = nn.MaxPool2d(2)
self.bottleneck = DoubleConv(512, 1024)
self.up4 = nn.ConvTranspose2d(1024, 512, kernel_size=2, stride=2)
self.conv4 = DoubleConv(1024, 512)
self.up3 = nn.ConvTranspose2d(512, 256, kernel_size=2, stride=2)
self.conv3 = DoubleConv(512, 256)
self.up2 = nn.ConvTranspose2d(256, 128, kernel_size=2, stride=2)
self.conv2 = DoubleConv(256, 128)
self.up1 = nn.ConvTranspose2d(128, 64, kernel_size=2, stride=2)
self.conv1 = DoubleConv(128, 64)
self.out = nn.Conv2d(64, out_channels, kernel_size=1)
def forward(self, x):
d1 = self.down1(x)
p1 = self.pool1(d1)
d2 = self.down2(p1)
p2 = self.pool2(d2)
d3 = self.down3(p2)
p3 = self.pool3(d3)
d4 = self.down4(p3)
p4 = self.pool4(d4)
bn = self.bottleneck(p4)
up4 = self.up4(bn)
merge4 = torch.cat([up4, d4], dim=1)
c4 = self.conv4(merge4)
up3 = self.up3(c4)
merge3 = torch.cat([up3, d3], dim=1)
c3 = self.conv3(merge3)
up2 = self.up2(c3)
merge2 = torch.cat([up2, d2], dim=1)
c2 = self.conv2(merge2)
up1 = self.up1(c2)
merge1 = torch.cat([up1, d1], dim=1)
c1 = self.conv1(merge1)
out = self.out(c1)
return out
Initialization
We initialize the U-Net model, specifying 3 input channels (for RGB images) and 1 output channel (for the binary segmentation mask). The model is moved to the appropriate device, using a GPU (cuda) if available, or falling back to the CPU.
# Initialize model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = UNet(in_channels=3, out_channels=1).to(device)
Loss Function
For training, we use the binary cross-entropy with logits loss (BCEWithLogitsLoss), which combines a sigmoid activation with binary cross-entropy loss in a single, numerically stable function.
This is appropriate because our task is a per-pixel binary classification: each pixel in the output mask must be classified as either foreground (object) or background (non-object).
Since the model outputs a raw score for each pixel, we need to apply a sigmoid function to convert it to a probability between 0 and 1. BCEWithLogitsLoss handles this internally and computes the correct loss between the predicted probabilities and the ground truth binary mask.
Optimizer
We optimize the model using the Adam optimizer, which is a popular choice because it automatically adjusts the learning rate for each parameter based on the history of its gradients.
We set the base learning rate to 0.0001, which controls how large each parameter update step is. Compared to simpler optimizers like stochastic gradient descent (SGD), Adam often leads to faster and more stable convergence.
# Loss function
criterion = nn.BCEWithLogitsLoss()
# Optimizer
optimizer = optim.Adam(model.parameters(), lr=0.0001)
Training
We train the model over multiple epochs, where each epoch means the model has seen the entire training dataset once.
We set a large maximum number of epochs (100), but we also implement early stopping: this means we monitor the validation loss after each epoch, and if it stops improving for a fixed number of consecutive epochs (here, patience = 5), we stop training early to avoid overfitting and wasting time.
During each epoch:
- We run the model in training mode, compute the loss on each batch of training images, backpropagate the gradients, and update the model weights.
- After finishing all training batches, we evaluate the model on the validation set (without gradient updates) to track how well it generalizes.
- We calculate the average training and validation loss per epoch and print them for monitoring.
If the validation loss improves, we reset the patience counter and save the model checkpoint (best_model.pth). If the validation loss stops improving, we count the number of epochs without improvement, and once it reaches the patience threshold, we stop the loop.
This approach helps prevent overfitting and ensures we keep the best-performing model from training.
num_epochs = 100
best_val_loss = float('inf')
patience = 5
epochs_without_improvement = 0
for epoch in range(num_epochs):
model.train()
train_loss = 0.0
for images, masks in train_loader:
images = images.to(device)
masks = masks.to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, masks)
loss.backward()
optimizer.step()
train_loss += loss.item() * images.size(0)
avg_train_loss = train_loss / len(train_loader.dataset)
# Validation
model.eval()
val_loss = 0.0
with torch.no_grad():
for images, masks in val_loader:
images = images.to(device)
masks = masks.to(device)
outputs = model(images)
loss = criterion(outputs, masks)
val_loss += loss.item() * images.size(0)
avg_val_loss = val_loss / len(val_loader.dataset)
print(f"Epoch [{epoch + 1}/{num_epochs}] "
f"Train Loss: {avg_train_loss:.4f} "
f"Val Loss: {avg_val_loss:.4f}")
# Early stopping check:
if avg_val_loss < best_val_loss:
best_val_loss = avg_val_loss
epochs_without_improvement = 0
torch.save(model.state_dict(), 'best_model.pth')
else:
epochs_without_improvement += 1
if epochs_without_improvement >= patience:
print(f"Early stopping triggered after {epoch + 1} epochs.")
break
Epoch [1/100] Train Loss: 0.4976 Val Loss: 0.5171 Epoch [2/100] Train Loss: 0.4435 Val Loss: 0.4596 Epoch [3/100] Train Loss: 0.4235 Val Loss: 0.4730 Epoch [4/100] Train Loss: 0.4008 Val Loss: 0.4172 Epoch [5/100] Train Loss: 0.3893 Val Loss: 0.4075 Epoch [6/100] Train Loss: 0.3757 Val Loss: 0.3883 Epoch [7/100] Train Loss: 0.3720 Val Loss: 0.3826 Epoch [8/100] Train Loss: 0.3634 Val Loss: 0.4084 Epoch [9/100] Train Loss: 0.3534 Val Loss: 0.3875 Epoch [10/100] Train Loss: 0.3548 Val Loss: 0.3680 Epoch [11/100] Train Loss: 0.3551 Val Loss: 0.4333 Epoch [12/100] Train Loss: 0.3568 Val Loss: 0.4223 Epoch [13/100] Train Loss: 0.3454 Val Loss: 0.3535 Epoch [14/100] Train Loss: 0.3458 Val Loss: 0.3858 Epoch [15/100] Train Loss: 0.3366 Val Loss: 0.3503 Epoch [16/100] Train Loss: 0.3374 Val Loss: 0.3573 Epoch [17/100] Train Loss: 0.3350 Val Loss: 0.3621 Epoch [18/100] Train Loss: 0.3318 Val Loss: 0.3666 Epoch [19/100] Train Loss: 0.3316 Val Loss: 0.3512 Epoch [20/100] Train Loss: 0.3243 Val Loss: 0.5116 Early stopping triggered after 20 epochs.
import os
import random
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
resize_transform = transforms.Compose([
transforms.Resize((512, 512)),
])
class SpaghettiDataset(Dataset):
def __init__(self, image_dir, mask_dir, file_list, transform=None):
self.image_dir = image_dir
self.mask_dir = mask_dir
self.file_list = file_list
self.transform = transform
def __len__(self):
return len(self.file_list)
def __getitem__(self, idx):
img_name = self.file_list[idx]
img_path = os.path.join(self.image_dir, img_name)
mask_path = os.path.join(self.mask_dir, os.path.splitext(img_name)[0] + '_mask.png')
image = Image.open(img_path).convert('RGB')
mask = Image.open(mask_path).convert('L')
image = resize_transform(image)
mask = resize_transform(mask)
image = np.array(image) / 255.0
mask = np.array(mask) / 255.0
mask = (mask > 0.5).astype(np.float32)
image = torch.tensor(image, dtype=torch.float32).permute(2, 0, 1)
mask = torch.tensor(mask, dtype=torch.float32).unsqueeze(0)
return image, mask
# -------- Load geojson mask without rasterio --------
def load_geojson_mask(geojson_path, target_shape):
"""
Load a geojson file and rasterize it into a binary mask array (without rasterio).
:param geojson_path: Path to the .geojson file
:param target_shape: (height, width) tuple
:return: numpy array (0 or 1) of shape target_shape
"""
mask_img = Image.new('L', (target_shape[1], target_shape[0]), 0) # width, height
draw = ImageDraw.Draw(mask_img)
with open(geojson_path, 'r') as f:
geojson_data = json.load(f)
for feature in geojson_data['features']:
geom_type = feature['geometry']['type']
coords_root = feature['geometry']['coordinates']
if geom_type == 'Polygon':
polygons = [coords_root]
elif geom_type == 'MultiPolygon':
polygons = coords_root
else:
continue # skip unsupported types
for polygon in polygons:
exterior = polygon[0]
# Draw polygon (note: assumes coordinates are already scaled to image size)
draw.polygon(exterior, outline=1, fill=1)
return np.array(mask_img)
# -------- Dataset class --------
class HandSegmentedDataset(Dataset):
def __init__(self, image_dir, geojson_dir, file_list, transform=None):
self.image_dir = image_dir
self.geojson_dir = geojson_dir
self.file_list = file_list
self.transform = transform
def __len__(self):
return len(self.file_list)
def __getitem__(self, idx):
img_name = self.file_list[idx]
img_path = os.path.join(self.image_dir, img_name)
geojson_path = os.path.join(self.geojson_dir, os.path.splitext(img_name)[0] + '.geojson')
# Load and resize image
image = Image.open(img_path).convert('RGB')
image_resized = resize_transform(image)
image_array = np.array(image_resized) / 255.0
# Load and rasterize mask
original_shape = image.size[::-1] # (height, width)
gt_mask_array = load_geojson_mask(geojson_path, original_shape)
# Resize mask to match transformed image size
gt_mask_pil = Image.fromarray(gt_mask_array.astype(np.uint8) * 255)
gt_mask_resized = resize_transform(gt_mask_pil)
mask_array = np.array(gt_mask_resized) / 255.0
mask_array = (mask_array > 0.5).astype(np.float32)
# Convert to tensors
image_tensor = torch.tensor(image_array, dtype=torch.float32).permute(2, 0, 1)
mask_tensor = torch.tensor(mask_array, dtype=torch.float32).unsqueeze(0)
return image_tensor, mask_tensor
# Directories
image_dir = 'images/unified_image_pool'
mask_dir = 'images/unified_segmentation_masks'
hand_image_dir = 'data_frangi/'
hand_geojson_dir = 'data_frangi/'
hand_files = [f for f in os.listdir(hand_image_dir) if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
hand_dataset = HandSegmentedDataset(hand_image_dir, hand_geojson_dir, hand_files)
hand_loader = DataLoader(hand_dataset, batch_size=1, shuffle=False)
# Collect image filenames
all_files = [f for f in os.listdir(image_dir) if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
random.shuffle(all_files)
# Split into train/val/test
n = len(all_files)
train_files = all_files[:int(0.8 * n)]
val_files = all_files[int(0.8 * n):int(0.9 * n)]
test_files = all_files[int(0.9 * n):]
# Create datasets
train_dataset = SpaghettiDataset(image_dir, mask_dir, train_files)
val_dataset = SpaghettiDataset(image_dir, mask_dir, val_files)
test_dataset = SpaghettiDataset(image_dir, mask_dir, test_files)
# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=4, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=4, shuffle=False)
class DoubleConv(nn.Module):
def __init__(self, in_channels, out_channels):
super(DoubleConv, self).__init__()
self.double_conv = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True),
nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True)
)
def forward(self, x):
return self.double_conv(x)
class UNet(nn.Module):
def __init__(self, in_channels=3, out_channels=1):
super(UNet, self).__init__()
self.down1 = DoubleConv(in_channels, 64)
self.pool1 = nn.MaxPool2d(2)
self.down2 = DoubleConv(64, 128)
self.pool2 = nn.MaxPool2d(2)
self.down3 = DoubleConv(128, 256)
self.pool3 = nn.MaxPool2d(2)
self.down4 = DoubleConv(256, 512)
self.pool4 = nn.MaxPool2d(2)
self.bottleneck = DoubleConv(512, 1024)
self.up4 = nn.ConvTranspose2d(1024, 512, kernel_size=2, stride=2)
self.conv4 = DoubleConv(1024, 512)
self.up3 = nn.ConvTranspose2d(512, 256, kernel_size=2, stride=2)
self.conv3 = DoubleConv(512, 256)
self.up2 = nn.ConvTranspose2d(256, 128, kernel_size=2, stride=2)
self.conv2 = DoubleConv(256, 128)
self.up1 = nn.ConvTranspose2d(128, 64, kernel_size=2, stride=2)
self.conv1 = DoubleConv(128, 64)
self.out = nn.Conv2d(64, out_channels, kernel_size=1)
def forward(self, x):
d1 = self.down1(x)
p1 = self.pool1(d1)
d2 = self.down2(p1)
p2 = self.pool2(d2)
d3 = self.down3(p2)
p3 = self.pool3(d3)
d4 = self.down4(p3)
p4 = self.pool4(d4)
bn = self.bottleneck(p4)
up4 = self.up4(bn)
merge4 = torch.cat([up4, d4], dim=1)
c4 = self.conv4(merge4)
up3 = self.up3(c4)
merge3 = torch.cat([up3, d3], dim=1)
c3 = self.conv3(merge3)
up2 = self.up2(c3)
merge2 = torch.cat([up2, d2], dim=1)
c2 = self.conv2(merge2)
up1 = self.up1(c2)
merge1 = torch.cat([up1, d1], dim=1)
c1 = self.conv1(merge1)
out = self.out(c1)
return out
# Initialize model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = UNet(in_channels=3, out_channels=1).to(device)
# Metric functions
def dice_coefficient(pred, target):
pred = pred.float()
target = target.float()
intersection = (pred * target).sum()
return (2. * intersection) / (pred.sum() + target.sum() + 1e-8)
def iou_score(pred, target):
pred = pred.float()
target = target.float()
intersection = (pred * target).sum()
union = pred.sum() + target.sum() - intersection
return intersection / (union + 1e-8)
def pixel_accuracy(pred, target):
correct = (pred == target).sum()
total = torch.numel(pred)
return correct.float() / total
Important! Run the collapsed cell above if you are doing only evaluation and have not excuted all cells above. Otherwise the model cannot be loaded.ΒΆ
model.load_state_dict(torch.load('best_model.pth'))
# read the markdown cell above if you are getting an error :)
<All keys matched successfully>
Visual Comparison
We evaluate the trained model on the test set by running it in inference mode (using torch.no_grad() to disable gradient tracking).
We load the test images and pass them through the model to get predicted masks.
The raw model outputs are converted to binary masks by applying a sigmoid activation (to get pixelwise probabilities) and thresholding at 0.5.
For each test image, we plot:
- The original input image.
- The ground truth mask (from the labeled data).
- The predicted mask overlaid on top of the original image.
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)
# Load all test indices and shuffle
all_indices = list(range(len(test_dataset)))
random.shuffle(all_indices)
max_plots = 5
plot_count = 0
with torch.no_grad():
for idx in all_indices:
image, mask = test_dataset[idx]
image = image.unsqueeze(0).to(device) # Add batch dimension
output = model(image)
pred = torch.sigmoid(output) > 0.5 # Binary mask
img = image[0].cpu().permute(1, 2, 0).numpy()
gt_mask = mask.cpu().squeeze().numpy()
pred_mask = pred[0].cpu().squeeze().numpy()
plt.figure(figsize=(12, 4))
plt.subplot(1, 3, 1)
plt.title('Original Image')
plt.imshow(img)
plt.axis('off')
plt.subplot(1, 3, 2)
plt.title('Ground Truth Mask')
plt.imshow(gt_mask, cmap='gray')
plt.axis('off')
plt.subplot(1, 3, 3)
plt.title('Predicted Mask Overlay')
plt.imshow(img)
plt.imshow(pred_mask, cmap='Reds', alpha=0.5)
plt.axis('off')
plt.show()
plot_count += 1
if plot_count >= max_plots:
break
Based on the limited sample of image versur binary segmentation mask vs model prediction it appears that the model can quite accurately identify the spaghettified areas of a 3d print. We will now quantify this over the whole data set.
Metrics
To quantify the accuracy of the model, we use three standard image segmentation metrics:
Dice Coefficient
This metric measures the overlap between the predicted segmentation mask and the ground truth mask. It ranges from 0 (no overlap) to 1 (perfect match). Mathematically, it is defined as:
$$ \text{Dice} = \frac{2 \times |A \cap B|}{|A| + |B|} $$
where $A$ is the set of predicted positive pixels and $B$ is the set of ground truth positive pixels.
Intersection over Union (IoU)
Also known as the Jaccard Index, this metric calculates the ratio between the intersection and the union of the predicted and ground truth masks. It provides a more conservative estimate of overlap and is defined as:
$$ \text{IoU} = \frac{|A \cap B|}{|A \cup B|} $$
Pixel Accuracy
This metric measures the proportion of correctly classified pixels over the total number of pixels in the image. It gives an overall accuracy score without focusing specifically on the shape or boundaries of the segments.
In the evaluation, we loop over all images in the test dataset, compute these metrics for each image, and then calculate the mean and standard deviation across the entire dataset. This provides a summary of the modelβs average performance and the variability of its predictions.
# Set model to eval mode
model.eval()
# Accumulators
dice_scores = []
iou_scores = []
accuracies = []
with torch.no_grad():
for images, masks in test_loader:
images = images.to(device)
masks = masks.to(device)
outputs = model(images)
preds = torch.sigmoid(outputs) > 0.5 # Binary mask
# For each image in the batch
for i in range(images.size(0)):
pred_mask = preds[i]
gt_mask = masks[i]
dice = dice_coefficient(pred_mask, gt_mask)
iou = iou_score(pred_mask, gt_mask)
acc = pixel_accuracy(pred_mask, gt_mask)
dice_scores.append(dice.item())
iou_scores.append(iou.item())
accuracies.append(acc.item())
# Compute averages and standard deviations
avg_dice = np.mean(dice_scores)
std_dice = np.std(dice_scores)
avg_iou = np.mean(iou_scores)
std_iou = np.std(iou_scores)
avg_acc = np.mean(accuracies)
std_acc = np.std(accuracies)
print(f"Model prediction versus presegmented dataset")
print(f"Dice Coefficient: Mean = {avg_dice:.4f}, standard deviation = {std_dice:.4f}")
print(f"IoU Score: Mean = {avg_iou:.4f}, standard deviation = {std_iou:.4f}")
print(f"Pixel Accuracy: Mean = {avg_acc:.4f}, standard deviation = {std_acc:.4f}")
Model prediction versus presegmented dataset Dice Coefficient: Mean = 0.5161, standard deviation = 0.2947 IoU Score: Mean = 0.3979, standard deviation = 0.2555 Pixel Accuracy: Mean = 0.8696, standard deviation = 0.0875
The combination of a moderately low dice coefficient as well as IoU score combined with a relatively high pixel accuracy score indicates what was already noticeable in the visualizations: the segmentation masks used for training the model were large rectangles that often covered more area than the actual spaghetti. Therefore it is in fact expected that the union between the model prediction and the segmentation mask is low, given the the assumption that the model has learned the correct patterns.
For a more accurate comparison we will now look at the same metrics as applied to the ten manually segmented images. We would expect that the union metrics improve substantially, given that their segmentation masks follow the true shape of the spaghetti.
dice_scores = []
iou_scores = []
accuracies = []
model.eval()
with torch.no_grad():
for images, masks in hand_loader:
images = images.to(device)
masks = masks.to(device)
outputs = model(images)
preds = torch.sigmoid(outputs) > 0.5
pred_mask = preds[0]
gt_mask = masks[0]
dice = dice_coefficient(pred_mask, gt_mask)
iou = iou_score(pred_mask, gt_mask)
acc = pixel_accuracy(pred_mask, gt_mask)
dice_scores.append(dice.item())
iou_scores.append(iou.item())
accuracies.append(acc.item())
# -------- Report --------
avg_dice = np.mean(dice_scores)
std_dice = np.std(dice_scores)
avg_iou = np.mean(iou_scores)
std_iou = np.std(iou_scores)
avg_acc = np.mean(accuracies)
std_acc = np.std(accuracies)
print(f"Model prediction versus hand segmented dataset (n=10)")
print(f"Dice Coefficient: Mean = {avg_dice:.4f}, Std = {std_dice:.4f}")
print(f"IoU Score: Mean = {avg_iou:.4f}, Std = {std_iou:.4f}")
print(f"Pixel Accuracy: Mean = {avg_acc:.4f}, Std = {std_acc:.4f}")
Model prediction versus hand segmented dataset (n=10) Dice Coefficient: Mean = 0.6256, Std = 0.2262 IoU Score: Mean = 0.4880, Std = 0.2041 Pixel Accuracy: Mean = 0.9154, Std = 0.0376
As expected, while the pixel accuracy of the model is slightly higher when comparing against accurately segmented images, the union metrics show a more substantial improvement.
# Load all indices from hand-segmented dataset and shuffle
all_indices = list(range(len(hand_dataset)))
random.shuffle(all_indices)
max_plots = 5
plot_count = 0
model.eval()
with torch.no_grad():
for idx in all_indices:
image, mask = hand_dataset[idx]
image = image.unsqueeze(0).to(device) # Add batch dimension
output = model(image)
pred = torch.sigmoid(output) > 0.5 # Binary mask
img = image[0].cpu().permute(1, 2, 0).numpy()
gt_mask = mask.cpu().squeeze().numpy()
pred_mask = pred[0].cpu().squeeze().numpy()
plt.figure(figsize=(12, 4))
plt.subplot(1, 3, 1)
plt.title('Original Image')
plt.imshow(img)
plt.axis('off')
plt.subplot(1, 3, 2)
plt.title('Ground Truth Mask (GeoJSON)')
plt.imshow(gt_mask, cmap='gray')
plt.axis('off')
plt.subplot(1, 3, 3)
plt.title('Predicted Mask Overlay')
plt.imshow(img)
plt.imshow(pred_mask, cmap='Reds', alpha=0.5)
plt.axis('off')
plt.show()
plot_count += 1
if plot_count >= max_plots:
break
These samples are in agreement with the claim made earlier about the behaviour of the union metrics depending on the type of segmentation used.
DiscussionΒΆ
$\Rightarrow$ Briefly discuss your results and share your key observations and your experiences and leaernings.